<<BACK

(Remi/David) Now, if you remember what we did yesterday, we loaded and cleaned my larval abundance data, but I did not follow the principles of ‘tidy’ data when I uploaded my data.

Again, this data comes from one of Remi’s PhD thesis chapters1 that was part of the first CHONe. We cleaned in the Data cleaning and raw data management and is also archived on Dryad so we would have cleaning to do!

Reminder: The ‘tidy’ concept is something we will revisit in the R section, but the rules of tidy data are2:

  1. Each variable you measure should be in one column
  2. Each different observation of that variable should be in a different row
  3. There should be one table for each “kind” of variable
  4. If you have multiple tables, they should include a column in the table that allows them to be linked

While it was certainly easier to enter the abundance data in a spreadsheet by giving each species a column, this does not abide by the ‘tidy’ rules and we will have a much easier time analysing and plotting our data if we first ‘tidy’ the data.

# let's load all packages and the data before we begin
library(sf)
library(raster)
library(marmap)
library(robis)
library(tidyverse)

larvalAbundance <- read.csv("data/larvalAbundanceClean.csv", stringsAsFactors = FALSE)

head(larvalAbundance)
##   year month day     time depth locationID decimalLatitude
## 1 2008     8   7 10:40:00     3          5        45.75782
## 2 2008     8   8  9:05:00    12          5        45.75782
## 3 2008     8   7 12:14:00    12          6        45.77987
## 4 2008     8   7 13:47:00    12          7        45.80723
## 5 2008     8   7 15:28:00    12          8        45.82607
## 6 2008     8   7 17:05:00    12          9        45.73428
##   decimalLongitude Astyris.lunata Bittiolum.alternatum Margarites.spp.
## 1        -61.87513      4.4783154            0.3731930        9.516420
## 2        -61.87513     12.5483524            0.0000000        5.377865
## 3        -61.79735     31.8337216            3.9300891       11.135252
## 4        -61.69652     21.1619877            1.9534143        4.883536
## 5        -61.58807      0.3244847            0.3785655        1.027535
## 6        -61.62138     23.5602998            0.2926745        4.682793
##   Arrhoges.occidentalis Diaphana.minuta Crepidula.spp. Other.Gastropods
## 1            0.37319295       1.6793683     4.47831542        1.1195789
## 2            0.71704871       0.8963109     7.70827362        1.7926218
## 3            0.78601782       4.9781128    22.66351371        1.7030386
## 4            0.16278452       1.3022762     2.76733685        0.6511381
## 5            0.05408078       0.1352020     0.02704039        0.2433635
## 6            0.14633727       2.6340708     2.63407078        0.7316863
##   Mytilus.spp. Modiolus.modiolus Anomia.simplex Other.Bivalve
## 1    9.5164203         0.1865965      0.5597894     8.7700344
## 2   12.0105659         0.5377865      0.8963109     4.6608166
## 3    5.6331277         0.2620059      0.6550148     1.9650445
## 4    3.7440440         0.3255690      0.4883536     1.3022762
## 5    1.1356964         0.0000000      0.6760098     0.3515251
## 6    0.5853491         0.2926745      0.2926745     0.7316863
##   Electra.pilosa Membranipora.membranacea Carcinus.maenas Cancer.irroratus
## 1      12.501964                0.1865965      0.05937161       0.16963316
## 2       3.943768                0.0000000      0.10508473       0.17308072
## 3      35.501805                0.3930089      0.05240119       2.30565226
## 4      18.557435                0.1627845      0.00000000       1.17204855
## 5       1.027535                0.0000000      0.00000000       0.04506732
## 6      23.852974                0.1463373      0.08780236       0.45364552
##   Neopanopeus.sayi Crangon.septemspinosa
## 1      0.000000000           0.271413056
## 2      0.006181454           0.080358907
## 3      0.000000000           1.074224349
## 4      0.000000000           1.041820933
## 5      0.000000000           0.009013463
## 6      0.000000000           0.936558500

The tidyverse ‘meta’ package

The tidyverse is a package that is made up of other packages commonly used together for data analysis and vizualization. You can find much more detail in the book R for Data Science by Garrett Grolemund and Hadley Wickham. Below we will learn about 3 of them:

  • tidyr for tidying data
  • dplyr for manipulating data
  • ggplot2 for plotting data

The tidyr package

This aptly named package will allow us to tidy our data. Our data is currently in the wide format since we have multiple observations of species abundances in seperate columns. We want our data to be in the long format in order to facilitate analysis and plotting (and to abide by the ‘tidy’ rules).

We currently have a few ID type variable (i.e. they are not observations; year,month,day,time,depth,site,long,lat) and an observational variable for each species.

names(larvalAbundance)
##  [1] "year"                     "month"                   
##  [3] "day"                      "time"                    
##  [5] "depth"                    "locationID"              
##  [7] "decimalLatitude"          "decimalLongitude"        
##  [9] "Astyris.lunata"           "Bittiolum.alternatum"    
## [11] "Margarites.spp."          "Arrhoges.occidentalis"   
## [13] "Diaphana.minuta"          "Crepidula.spp."          
## [15] "Other.Gastropods"         "Mytilus.spp."            
## [17] "Modiolus.modiolus"        "Anomia.simplex"          
## [19] "Other.Bivalve"            "Electra.pilosa"          
## [21] "Membranipora.membranacea" "Carcinus.maenas"         
## [23] "Cancer.irroratus"         "Neopanopeus.sayi"        
## [25] "Crangon.septemspinosa"
larvalAbundanceLong <- gather(larvalAbundance,                                  # the old data frame
                              key = "scientificName",                           # what the column of old column names will be called
                              value = "larvalAbundanceIndPerm3",                # what the gathered numbers will be called
                              Astyris.lunata:Crangon.septemspinosa)             # which columns  to gather

# alternatively, I could of specified which columns not to include instead
larvalAbundanceLong <- gather(larvalAbundance,                                  # the old data frame
                              key = "scientificName",                           # what the column of old column names will be called
                              value = "larvalAbundanceIndPerm3",                # what the gathered numbers will be called
                              -year,-month,-day,-time,-depth,-locationID,-decimalLongitude,-decimalLatitude)  # which columns not to gather

head(larvalAbundanceLong)                           
##   year month day     time depth locationID decimalLatitude
## 1 2008     8   7 10:40:00     3          5        45.75782
## 2 2008     8   8  9:05:00    12          5        45.75782
## 3 2008     8   7 12:14:00    12          6        45.77987
## 4 2008     8   7 13:47:00    12          7        45.80723
## 5 2008     8   7 15:28:00    12          8        45.82607
## 6 2008     8   7 17:05:00    12          9        45.73428
##   decimalLongitude scientificName larvalAbundanceIndPerm3
## 1        -61.87513 Astyris.lunata               4.4783154
## 2        -61.87513 Astyris.lunata              12.5483524
## 3        -61.79735 Astyris.lunata              31.8337216
## 4        -61.69652 Astyris.lunata              21.1619877
## 5        -61.58807 Astyris.lunata               0.3244847
## 6        -61.62138 Astyris.lunata              23.5602998

We can also clean up and/or separate the scientificName column

# first let's swap multiple species 
larvalAbundanceLong <- separate(larvalAbundanceLong,           #old data frame
                                col = scientificName,          # column to be seperated
                                into = c("genus","species"),   # to be seperated into
                                extra = "merge",               # if there are too many ".", what to do with the extra column
                                remove = FALSE,                # whether to delete the seperated column
                                sep = "\\.")                   # regex for ".", but need escape character "\\" since . is a wildcard

# Let's check out our work
head(larvalAbundanceLong)
##   year month day     time depth locationID decimalLatitude
## 1 2008     8   7 10:40:00     3          5        45.75782
## 2 2008     8   8  9:05:00    12          5        45.75782
## 3 2008     8   7 12:14:00    12          6        45.77987
## 4 2008     8   7 13:47:00    12          7        45.80723
## 5 2008     8   7 15:28:00    12          8        45.82607
## 6 2008     8   7 17:05:00    12          9        45.73428
##   decimalLongitude scientificName   genus species larvalAbundanceIndPerm3
## 1        -61.87513 Astyris.lunata Astyris  lunata               4.4783154
## 2        -61.87513 Astyris.lunata Astyris  lunata              12.5483524
## 3        -61.79735 Astyris.lunata Astyris  lunata              31.8337216
## 4        -61.69652 Astyris.lunata Astyris  lunata              21.1619877
## 5        -61.58807 Astyris.lunata Astyris  lunata               0.3244847
## 6        -61.62138 Astyris.lunata Astyris  lunata              23.5602998
# genus and species don't look terrible, but scientificName still has that pesky period
# we could use gsub again, but the regex is tricky since there are spp. for species
# pasting the genus and species together will be much easier
larvalAbundanceLong$scientificName <- paste(larvalAbundanceLong$genus,larvalAbundanceLong$species,sep = " ")

# also, spp. is not a species designation, let's get rid of them
larvalAbundanceLong$species[larvalAbundanceLong$species=="spp."] <- NA

# Let's check out our work again
head(larvalAbundanceLong)
##   year month day     time depth locationID decimalLatitude
## 1 2008     8   7 10:40:00     3          5        45.75782
## 2 2008     8   8  9:05:00    12          5        45.75782
## 3 2008     8   7 12:14:00    12          6        45.77987
## 4 2008     8   7 13:47:00    12          7        45.80723
## 5 2008     8   7 15:28:00    12          8        45.82607
## 6 2008     8   7 17:05:00    12          9        45.73428
##   decimalLongitude scientificName   genus species larvalAbundanceIndPerm3
## 1        -61.87513 Astyris lunata Astyris  lunata               4.4783154
## 2        -61.87513 Astyris lunata Astyris  lunata              12.5483524
## 3        -61.79735 Astyris lunata Astyris  lunata              31.8337216
## 4        -61.69652 Astyris lunata Astyris  lunata              21.1619877
## 5        -61.58807 Astyris lunata Astyris  lunata               0.3244847
## 6        -61.62138 Astyris lunata Astyris  lunata              23.5602998

The dplyr package

Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Astyris lunata", "larvalAbundanceIndPerm3"])
## [1] 13.85279
mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Bittiolum alternatum", "larvalAbundanceIndPerm3"])
## [1] 1.713291
mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Arrhoges occidentalis", "larvalAbundanceIndPerm3"])
## [1] 0.7773257

But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (%>%) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()

Using select()

If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.

yearAndSpecies <- select(larvalAbundanceLong,year,species,genus,larvalAbundanceIndPerm3)

If we open up yearAndSpecies we’ll see that it only contains the year, and the larval abundance for each species.

Above we used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes (%>%). Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

yearTaxonAbun <- larvalAbundanceLong %>% select(year,species,genus,larvalAbundanceIndPerm3)

To help you understand why we wrote that in that way, let’s walk through it step by step. In both cases, the left side of <- is yearTaxonAbun. First we summon the larvalAbundanceLong dataframe and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don’t specify which .data argument we use in the select() function since in gets that from the previous pipe. Essentially, the output of what is written before the pipe becomes the first argument, in this case .data, of the function after the pipe. Fun Fact: There is a chance you have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is | but the concept is the same!

Pro-tip

Did you get an error while running the select() function? Something like: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘select’ for signature ‘"data.frame"’ That’s because many packages have a select() function; In this case, it’s the raster package causing problems. It is possible to get around that problem the ‘hacky’ way by loading tidyverse after all the other packages so that the dplyr (which is in the tidyverse) version of select sort of overwrites the one loaded from raster.

Alternatively, you can designate which package using the :: (e.g. package::function()) when it’s not clear. yearTaxonAbun <- larvalAbundanceLong %>% dplyr::select(year,species,genus,larvalAbundanceIndPerm3)

Using filter()

If we now wanted to move forward with the above, but only with species for which we have a valid genus and species, we can combine select and filter

yearTaxonAbunValid <- larvalAbundanceLong %>%
    select(year,species,genus,larvalAbundanceIndPerm3) %>% 
    filter(genus!="Other") %>% 
    filter(!is.na(species))

# we can also string the two filter operations into one using the and symbol '&'

yearTaxonAbunValid <- larvalAbundanceLong %>%
    select(year,species,genus,larvalAbundanceIndPerm3) %>% 
    filter(genus!="Other" & !is.na(species))

** Challenge 1 **

Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the values for larvalAbundanceIndPerm3 for all species at locationID 5, but not for other sites. I also want locationID, and lat/long removed from the output. How many rows does your dataframe have and why? Solution

Using group_by()

Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each species. Instead of filter(), which will only pass observations that meet your criteria (in the above: locationID==5), we can use group_by(), which will essentially use every unique criteria that you could have used in filter.

str(larvalAbundanceLong)
## 'data.frame':    1343 obs. of  12 variables:
##  $ year                   : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ month                  : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ day                    : int  7 8 7 7 7 7 7 8 8 8 ...
##  $ time                   : chr  "10:40:00" "9:05:00" "12:14:00" "13:47:00" ...
##  $ depth                  : int  3 12 12 12 12 12 12 12 12 12 ...
##  $ locationID             : int  5 5 6 7 8 9 10 11 12 13 ...
##  $ decimalLatitude        : num  45.8 45.8 45.8 45.8 45.8 ...
##  $ decimalLongitude       : num  -61.9 -61.9 -61.8 -61.7 -61.6 ...
##  $ scientificName         : chr  "Astyris lunata" "Astyris lunata" "Astyris lunata" "Astyris lunata" ...
##  $ genus                  : chr  "Astyris" "Astyris" "Astyris" "Astyris" ...
##  $ species                : chr  "lunata" "lunata" "lunata" "lunata" ...
##  $ larvalAbundanceIndPerm3: num  4.478 12.548 31.834 21.162 0.324 ...
str(larvalAbundanceLong %>% group_by(locationID))
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  1343 obs. of  12 variables:
##  $ year                   : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ month                  : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ day                    : int  7 8 7 7 7 7 7 8 8 8 ...
##  $ time                   : chr  "10:40:00" "9:05:00" "12:14:00" "13:47:00" ...
##  $ depth                  : int  3 12 12 12 12 12 12 12 12 12 ...
##  $ locationID             : int  5 5 6 7 8 9 10 11 12 13 ...
##  $ decimalLatitude        : num  45.8 45.8 45.8 45.8 45.8 ...
##  $ decimalLongitude       : num  -61.9 -61.9 -61.8 -61.7 -61.6 ...
##  $ scientificName         : chr  "Astyris lunata" "Astyris lunata" "Astyris lunata" "Astyris lunata" ...
##  $ genus                  : chr  "Astyris" "Astyris" "Astyris" "Astyris" ...
##  $ species                : chr  "lunata" "lunata" "lunata" "lunata" ...
##  $ larvalAbundanceIndPerm3: num  4.478 12.548 31.834 21.162 0.324 ...
##  - attr(*, "vars")=List of 1
##   ..$ : symbol locationID
##  - attr(*, "drop")= logi TRUE
##  - attr(*, "indices")=List of 16
##   ..$ : int  0 1 12 23 34 45 46 62 63 79 ...
##   ..$ : int  2 13 24 35 47 64 81 92 103 114 ...
##   ..$ : int  3 14 25 36 48 65 82 93 104 115 ...
##   ..$ : int  4 15 26 37 49 66 83 94 105 116 ...
##   ..$ : int  5 16 27 38 50 67 84 95 106 117 ...
##   ..$ : int  6 17 28 39 51 68 85 96 107 118 ...
##   ..$ : int  7 18 29 40 52 69 86 97 108 119 ...
##   ..$ : int  8 19 30 41 53 70 87 98 109 120 ...
##   ..$ : int  9 20 31 42 54 71 88 99 110 121 ...
##   ..$ : int  10 21 32 43 55 72 89 100 111 122 ...
##   ..$ : int  11 22 33 44 56 73 90 101 112 123 ...
##   ..$ : int  57 74 136 153 215 232 294 311 373 390 ...
##   ..$ : int  58 75 137 154 216 233 295 312 374 391 ...
##   ..$ : int  59 76 138 155 217 234 296 313 375 392 ...
##   ..$ : int  60 77 139 156 218 235 297 314 376 393 ...
##   ..$ : int  61 78 140 157 219 236 298 315 377 394 ...
##  - attr(*, "group_sizes")= int  153 102 102 102 102 102 102 102 102 102 ...
##  - attr(*, "biggest_group_size")= int 153
##  - attr(*, "labels")='data.frame':   16 obs. of  1 variable:
##   ..$ locationID: int  5 6 7 8 9 10 11 12 13 14 ...
##   ..- attr(*, "vars")=List of 1
##   .. ..$ : symbol locationID
##   ..- attr(*, "drop")= logi TRUE

You will notice that the structure of the dataframe where we used group_by() (grouped_df) is not the same as the original larvalAbundanceLong (data.frame). A grouped_df can be thought of as a list where each item in the listis a data.frame which contains only the rows that correspond to the a particular value locationID (at least in the example above).

Using summarize()

The above was a bit on the uneventful side because group_by() much more exciting in conjunction with summarize(). This will allow use to create new variable(s) by using functions that repeat for each of the locationID-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

IndPerm3_byLocationID <- larvalAbundanceLong %>%
    group_by(locationID) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3))

That allowed us to calculate the mean larval abundance for each location, but it gets so much better!

** Challenge 2 **

Calculate the mean larval abundance per locationID. Which has the highest mean larval abundance and which has the lowest mean larval abundance? Solution

The function group_by() allows us to group by multiple variables. Let’s group by locationID and scientificName.

IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3))

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize().

IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3),
              sd_larvalAbundanceIndPerm3 = sd(larvalAbundanceIndPerm3),
              mean_depth = mean(depth),
              sd_depth= sd(depth))

Using mutate()

We can also create new variables prior to (or even after) summarizing information using mutate().

IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    mutate(larvalAbundanceIndPercm3=larvalAbundanceIndPerm3*1000000) %>% 
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPercm3 = mean(larvalAbundanceIndPercm3),
              sd_larvalAbundanceIndPercm3 = sd(larvalAbundanceIndPercm3),
              mean_depth = mean(depth),
              sd_depth= sd(depth))

Plotting

R comes with basic plotting function, usually referred to as ‘base plot’, e.g:

plot(larvalAbundance$Margarites.spp.,larvalAbundance$Diaphana.minuta)

However, general concensus seems to be that ggplot2 (part of the tidyverse) is the way to go for plotting. I (Remi) have mixed feelings about this package (my age is showing again). It does make making quick plots easier and is compatible with the rest of the tidyverse, but until recently, ggplot2 really struggled with any spatial plotting (see ‘The classic and the cutting edge: sp vs sf’) so I had to know the base plot anyway. I still think base plot is more customizeable, but it may also be that I have not sufficiently mastered the ggplot world. We’re mostly going to cover ggplot2, but be aware that there are other plotting paradigms in R (e.g. base, lattice, etc)

The ggplot2 package

A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”. You provide the data, tell ‘ggplot2’ how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details

So for the same plot as above, let me walk you through each line below step by step. The first line creates a new blank ggplot plot. Some people set aes() and data in ggplot(), but since different layers can take different aes() and data, I like to set those in the geom functions.

Unlike other tidyverse functions, you can tie together ggplot2 functions with + instead of %>%, but otherwise it works the same. So to our blank plot, we’re adding geom_point() which is the function for a scatterplot. We’re telling geom_point() that all the data it needs should be in larvalAbundance and what aesthetic mapping we want (aes()) which sets the variables used in the plot and takes arguments like x, y, colour, size.

Finally, I set the theme as theme_classic because the default plots are ugly! (try it, I dare you) #

ggplot() +
    geom_point(data = larvalAbundance, aes(x=Margarites.spp.,y=Diaphana.minuta)) +
    theme_classic()

Yeah, that was way more code than base plot just to get a simple plot! But now that we have the basic ggplot skeleton, our investment is going to pay off because it’s now super easy to modify. Let’s assign different colours for different depths.

ggplot() +
    geom_point(data = larvalAbundance, aes(x=Margarites.spp.,y=Diaphana.minuta,colour=as.character(depth))) +
    theme_classic()

I had to convert year and depth to characters since ggplot assumes that numeric variables are continous. Let’s combine what we learned above about the tidyverse.

# prepare data for plotting
larvalAbundancePlot <- larvalAbundance %>% 
    mutate(Year = as.character(year),
           Depth = as.character(depth))

# plot new data
ggplot() +
    geom_point(data=larvalAbundancePlot,aes(x=Margarites.spp.,y=Diaphana.minuta,colour=Depth,shape=Year)) +
    theme_classic()

Let’s add a linear regression and make the axes log scale:

ggplot() +
    geom_point(data=larvalAbundancePlot,aes(x=Margarites.spp.,y=Diaphana.minuta,colour=Depth,shape=Year)) +
    geom_smooth(method="lm")+
    scale_x_log10()+
    scale_y_log10()+
    theme_classic()
## Warning: Transformation introduced infinite values in continuous y-axis

Try commenting out (#) some of those lines to see what they do.

Now that you understand how that works, let’s try a few different plot types

# prepare long data for plotting
larvalAbundanceLongPlot <- larvalAbundanceLong %>% 
    mutate(Year = as.character(year),
           Depth = as.character(depth))

# plot new data
ggplot() +
    geom_boxplot(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    theme_classic()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 214 rows containing non-finite values (stat_boxplot).

Ready to get your mind blown? By adding 1 line to the above, we can have a plot for each species

ggplot() +
    geom_boxplot(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    facet_wrap(~scientificName)+
    theme_classic()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 214 rows containing non-finite values (stat_boxplot).

Or swapping out the geom gives you a completely different (better) plot.

ggplot() +
    geom_violin(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    facet_wrap(~scientificName)+
    theme_classic()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 214 rows containing non-finite values (stat_ydensity).

Or swapping out the geom again and we can map the abundance.

# prepare long data for plotting
larvalAbundanceLongPlot <- larvalAbundanceLong %>% 
    group_by(scientificName,decimalLongitude,decimalLatitude) %>% 
    summarize(mean_IndPerm3 = mean(larvalAbundanceIndPerm3))

# plot new mean data
ggplot() +
    geom_point(data=larvalAbundanceLongPlot,aes(x=decimalLongitude,y=decimalLatitude,size=log10(mean_IndPerm3))) +
    facet_wrap(~scientificName)+
    theme_classic()
## Warning in sqrt(x): NaNs produced
## Warning: Removed 13 rows containing missing values (geom_point).

Let’s add a bit more to these maps with some proper mapping tools. But before we move away from ggplot, bookmark this excellent “R Graph Catalog”. You can visually find a graph you want to make and get the code! Super useful!

Mapping

You can always plot any spatial data and use latitude and longitude as normal Cartesian coordinates as we did above, but if you want to do anything with projections, or other GIS type operations (buffer, intersection, overlaps, etc), you really need to dig a little deeper

The classic and the cutting edge: sp vs sf

Before November 2016, if you wanted to do any spatial tasks in R your choice was a simple one. You needed to use the sp package (along with rgeos, and rgdal). Now, the sf package is replacing that set of packages, and that’s a good thing because it allows for faster computation and it is compatible with the tidyverse. However, many of the helper packages that were built to interact with sp have not been upgraded to sf (yet).

The sp package

The sf package

Plot your study site

Define site map limits

Let’s start by defining a bounding box for your study site:

#define the corners
latmax <- 46.25
latmin <- 45.5
lonmax <- -61.25
lonmin <- -62.25

#create a matrix
bbmat <- cbind(
            c(lonmin,lonmax,lonmax,lonmin,lonmin),
            c(latmin,latmin,latmax,latmax,latmin)
        )
# make the matrix a 'simple features' polygon
bbsf <- st_polygon(list(bbmat))
# and let's give it information about the projection:
bbsfc <- st_sfc(bbsf,crs="+proj=longlat +datum=WGS84")
# finally, let's make it a simple features data.frame
bb <- st_sf(name="Study Site",geometry=bbsfc)
str(bb)
## Classes 'sf' and 'data.frame':   1 obs. of  2 variables:
##  $ name    : Factor w/ 1 level "Study Site": 1
##  $ geometry:sfc_POLYGON of length 1; first list element: List of 1
##   ..$ : num [1:5, 1:2] -62.2 -61.2 -61.2 -62.2 -62.2 ...
##   ..- attr(*, "class")= chr  "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
##   ..- attr(*, "names")= chr "name"
# plot it!
ggplot(bb)+
    geom_sf()

Riveting!

Add a basemap!

There are many ways to add a basemap in R

marmap package

bathymetry <- getNOAA.bathy(lonmin,lonmax,latmin,latmax,resolution=1,keep=TRUE)
## File already exists ; loading 'marmap_coord_-62.25;45.5;-61.25;46.25_res_1.csv'
plot.bathy(bathymetry,image=T,drawlabels=TRUE)

blues <- colorRampPalette(c("darkblue", "lightblue1"))
greys <- colorRampPalette(c(grey(0.4),grey(0.99)))

plot.bathy(bathymetry,
           image = TRUE,
           land = TRUE,
           n=0,
           bpal = list(c(0, max(bathymetry), greys(100)),
                       c(min(bathymetry), 0, blues(100))))

# this is a 'bathy' object and you could plot it as is with the tools provided in marmap, but to keep things 'simple' I'm going to convert to sf. There is no direct method yet, so... don't judge me!
# convert from bathy to raster to 'sp' polygon to simple features
# bathypoly <- marmap::as.raster(bathymetry) %>% 
    # rasterToPolygons() %>% st_as_sf
bathyras <- fortify.bathy(bathymetry)

bathyras$z[bathyras$z>0] <- NA

ggplot() +
    geom_raster(data=bathyras,aes(x=x,y=y,fill=z))+
    geom_sf(data=bb,fill="transparent",colour='black')

Canada <- getData('GADM', country="CAN", level=1) %>% 
    st_as_sf() %>% 
    filter(NAME_1=="Nova Scotia"|
               NAME_1=="Prince Edward Island"|
               NAME_1=="New Brunswick")
ggplot() +
    geom_sf(data=Canada,aes(fill=NAME_1))

x=0.2
ggplot() +
    geom_raster(data=bathyras,aes(x=x,y=y,fill=z))+
    geom_sf(data=bb,fill="transparent",colour='yellow',size=2)+
    geom_sf(data=Canada,fill='grey40')+
    coord_sf(xlim = c(lonmin-x,lonmax+x),
             ylim = c(latmin-x,latmax+x),
             expand = F) 

Get data from OBIS

OBIS is the Ocean Biogeographic Information System and their vision is: “To be the most comprehensive gateway to the world’s ocean biodiversity and biogeographic data and information required to address pressing coastal and world ocean concerns.”

We can get access to their HUGE database through the excellent robis package by ropensci (unsurprisingly, they also have other great open science R packages)

st_as_text(bb$geometry)
OBIS <- occurrence(geometry=st_as_text(bb$geometry))
write.csv(OBIS,'obis.csv',row.names = FALSE)
OBIS <- read.csv("obis.csv")
OBIS <- st_as_sf(OBIS,
                 coords = c("decimalLongitude", "decimalLatitude"),
                 crs="+proj=longlat +datum=WGS84",
                 remove=FALSE) %>% 
    filter(!is.na(species))

ggplot() +
    geom_raster(data=bathyras,aes(x=x,y=y,fill=z))+
    geom_sf(data=bb,fill="transparent",colour='yellow',size=2)+
    geom_sf(data=Canada,fill='grey40')+
    geom_point(data=OBIS,aes(colour=phylum,x=decimalLongitude,y=decimalLatitude,size=2))+
    coord_sf(xlim = c(lonmin-x,lonmax+x),
             ylim = c(latmin-x,latmax+x),
             expand = F)

ggplot(OBIS) +
    geom_raster(data=bathyras,aes(x=x,y=y,fill=z))+
    geom_sf(data=bb,fill="transparent",colour='yellow',size=2)+
    geom_sf(data=Canada,fill='grey40')+
    geom_point(data=OBIS,aes(x=decimalLongitude,y=decimalLatitude))+
    facet_wrap(~phylum)+
    coord_sf(xlim = c(lonmin-x,lonmax+x),
             ylim = c(latmin-x,latmax+x),
             expand = F)

Challenge answers

Challenge 1 solution

site5 <- larvalAbundanceLong %>%
   filter(locationID==5) %>%
   select(-locationID,-decimalLatitude,-decimalLatitude)

Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we would have removed it in the previous step.

Challenge 2 solution

IndPerm3_byLocationID <- larvalAbundanceLong %>%
   group_by(locationID) %>%
   summarize(mean_IndPerm3 = mean(larvalAbundanceIndPerm3)) %>% 
   filter(mean_IndPerm3 == min(mean_IndPerm3) | mean_IndPerm3 == max(mean_IndPerm3))

<<BACK


  1. Daigle RM, Metaxas A, deYoung B (2014) Bay-scale patterns in the distribution, aggregation and spatial variability of larvae of benthic invertebrates. Marine Ecology Progress Series 503:139-156. http://dx.doi.org/10.3354/meps10734

  2. Jeff Leek, The Elements of Data Analytic Style, Leanpub, 2015-03-02